Face Mask Detection

  • Hazem Fakhry
  • Adham Meligy

Problem Statement

During the past few years of the COVID-19 pandemic, medical face masks have become of increasing importance to combat the rapid spread of the virus. As a result, organizations and governments have started requiring individuals to be wearing face masks in numerous places, occasions, and circumstances. Therefore, it is the role of software engineers to automate the checking process of face masks. In this proposed project, we will create a face mask detection system, where it would output whether a person is wearing a face mask or not. It would also identify if the mask is not worn properly. This system can be of high value and be used in numerous ways such as: disallowing entry or enforcing fines on unmasked individuals.

Dataset

For our dataset, we merged considerable parts of datasets to create a combined, balanced dataset. We used a dataset of size 6.03 GBs containing 20,000+ images of correctly masked, incorrectly masked, and unmasked human faces. The datasets which we used to form our combined dataset were:

  • Human Faces by Ashwin Gupta
  • MaskedFace-Net (CMFD)
  • MaskedFace-Net (IMFD)

In our dataset, we ensured to divide our dataset evenly between the three classes and contain a distribution of images of different ethnicities, ages, and genders in order to ensure a balanced training and validation dataset.



Input/Output Examples

Our proposed system takes an image of a person as an input, and outputs one of three classes: whether the person is wearing a mask correctly, wearing a mask incorrectly, or not wearing a mask at all.



State of the art

CNN models are widely considered as the go-to models for image classification problems. Below is a table comparing the different CNN architectures, showing the accuracy, size, and training time of different CNN architectures.



Orignial Model from Literature

CNN stands for Convolutional Neural Network which is a specialized neural network for processing data that has an input shape like a 2D matrix like images. CNN models build a network with the neurons connected together in a 2D structure, and then the output is processed and produced. CNN's are typically used for image detection and classification as they process data in a grid-like structure, making them more useful for computer vision applications.




The model uses a pre-trained ResNet50 Convolutional Neural Net model, and uses transfer learning to learn weights of only the last layer of the net. The use of transfer learing would make use of a pre-trained model to give the model better generalization, while saving training time in less epochs. ResNet-50 is a CNN model that has a depth of 50. It is considered as a modern convolutional neural network that allows for deeper networks. It also has pooling and works on the TensorFlow framework. This model produced an accuracy of 98% using the Face Mask Detection dataset which is a set of 7000+ images labeled as withmask and withoutmask.



Proposed Updates

Update #1: Multinomial Classification

All of the above models only produce binary outputs (wearing a mask / not wearing a mask). Our proposed update is creating a model which identifies whether a mask is properly worn or not, in addition to the two base outputs. This means that our tentative model would have three possible outputs, which are: mask worn properly, mask not worn, mask worn improperly.

Update #2: Using Ensemble Learning

Since there exists several CNN architectures, which would give valid, accurate results, we have decided to use Ensemble Learning. We would use three CNN architectures ensembled in the hopes of achieving a better predictive performance than if only one architecture was used. The three CNN models which we inteded on using were:

  • ResNet-50
  • DenseNet-201
  • VGG-16
  • These models were chosen as they provide high accuracy while having a small model size and a relatively low training time.



    Update #3: Using a Larger Training Dataset

    In the original model reference, a relatively small, bi-class dataset was used with a size of 174.7 MB. To better train our model for a better generalization, we intend to use a much larger 6.03 GBs dataset which includes the three classes of images.

    Update #4: Early Stopping and Callbacks

    In order to prevent overfitting of our model and to ensure better generalization, we used early stopping during training of our models. This is especially useful since we use an iterative model. In addition, callbacks helped us ensure that we get the best possible result from our training.


    Update #5: Data Augmentation

    To further prevent overfitting, we used data augmentation on our dataset such as rotation of the image and zoom. This technique acts as a regularizer by increasing the data input to our model training by modifying the inputs or generating new data.


    Results

    In the ensembling, we used three CNN architectures that produced similar, very accurate results on our dataset. We used several metrics to judge the performance of our models: training loss, validation loss, and validation accuracy. Below are the tables showing the performance of each of the architectures before ensembling them together. The DenseNet model provided the highest accuracy out of the three models, and as their outputs are averaged together, our final model would give an even better predictive performance. These figures show an improvement to the original architecture, which had an accuracy of only 98%, where our models exceeded 99%.

    Model #1: ResNet-50

    Model #2: DenseNet-201

    Model #3: VGG-16



    In addition, confusion matrices were used for each model architecture to track the models' predictions relative to the ground truth. As shown by the figures below, all three models gave very accurate predictions with a very slight variation in their accuracy.

    Model #1: ResNet-50

    Model #2: DenseNet-201

    Model #3: VGG-16

    Technical report

    • Training Frameworks: fast.ai, Keras
    • Epochs Per Model: 4 epochs
    • Total Epochs: 12 epochs
    • Time Per Epoch: ~14 minutes
    • Total Training Time: ~168 minutes
    • Batch Size: 32
    • Ratio of Training:Validation Data: 4:1
    • Difficulties Faced:
      • Slow internet connection in Egypt made uploading a dataset very difficult
      • GPU runtime disconnecting in Colab due to reaching usage limits

    Conclusion

    We are very pleased with the high accuracy our model has reached and its performance. During the course of the project, we learned a lot more about the practical implementation of deep learning model and got exposed to things we would not have known otherwise. We needed to make frequent changes to the model due to the difficulties that we faced in different milestones; this made us research for different solutions to solve these difficulties and come up with more creative soltuions.

    Lessons Learned
  • New frameworks such as fast.ai
  • Checkpoints and callbacks
  • Different data loading techniques
  • Ensemble learning

  • Future Plans
  • Training the model to classify coloured masks in addition to the medical face masks